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Abstract 

Sequential Monte Carlo (SMC) approaches have become work horses in approximate Bayesian 
computation (ABC). Here we discuss how to construct the perturbation kernels that are required 
in ABC SMC approaches, in order to construct a set of distributions that start out from a suit- 
ably defined prior and converge towards the unknown posterior. We derive optimality criteria for 
different kernels, which are based on the Kullback-Leibler divergence between a distribution and 
the distribution of the perturbed particles. We will show that for many complicated posterior 
distributions locally adapted kernels tend to show the best performance. In cases where it is possi- 
ble to estimate the Fisher information we can construct particularly efficient perturbation kernels. 
We find that the added moderate cost of adapting kernel functions is easily regained in terms of 
the higher acceptance rate. We demonstrate the computational efficiency gains in a range of toy- 
examples which illustrate some of the challenges faced in real-world applications of ABC, before 
turning to a demanding parameter inference problem for a dynamical system, which highlights the 
huge increases in efficiency that can be gained from choice of optimal models. We conclude with a 
general discussion of rational choice of perturbation kernels in ABC SMC settings. 

1 Introduction 



Statistical practice and theory tend to reflect scientific fashions (Stigler, 1986). Today mathematical 
models in physics, engineering, biology, but also the social and engineering sciences are becoming 
increasingly complex. This, together with the deluge of data being produced in many fields, poses 



severe challenges to statistical inference (Efron 2010). In particular evaluation of the likelihood (Cox 



2006) 



L{9) = f(x\9), 

where x are realizations of the data, and is the (potentially vector- valued) parameter characterizing 
the data-generating process, is often turning out to be impractical. Approximate Bayesian Compu- 



tation (ABC) methods (Beaumont & Zhang, 2002; Marin et at, 2011) were first conceived to allow 



(Bayesian) statistical inference in situations where the evaluation of the likelihood is too complicated 



(Pritchard et al. 


1999 


Tanaka 


2006 


Lopes & Beaumont 


2010) 



than evaluating the likelihood directly, ABC-based approaches use systematic comparisons between 
real and simulated data in order to arrive at approximations of the true (but unobtainable) posterior 
distribution, 

p(9\x) oc /O#)tt(0), 
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where it (6) denotes the prior distribution of 8. 

Simulating from f(x\8) is generally straightforward, even if obtaining a reliable numerical/functional 
representation of the model is not possible. We then compare the simulated data, y, with the real 
data, x, and accept only those simulations where some distance measure, A(x,y), between the two 
falls below a specified threshold, e. If the data are too intricate or complicated it is common to replace 
a comparison of the real and simulated data, by a comparison of suitable summary statistics. This 
results in a often appreciable reduction of the dimension but is fraught with problems if the summary 



statistics are not sufficient. Given that sufficiency (Cox, 2006) is a rare quality indeed (Lehmann & 



Casella, 1993), and probably not given for any real- world problem of scientific interest, this problem is 



now attracting a lot of attention (Robert & Cornuet, 2011; Didelot et al., 2011 Fearnhead & Prangle 



2010). Here, however, we shall focus on the data directly; we thus seek to determine approximate 



p(8\x) « p e {8\x) oc / f(y\8) 1 (A(x, y) < e) n(8)dy, 



posteriors of the form, 



where y is the data simulated from the model f(-\8) for a given parameter, 8, drawn from the appro- 
priate prior distribution, and x is the observed data. 

The simple ABC scheme outlined above suffers from the same shortcomings as other rejection 
samplers: most of the samples are drawn from regions of parameter space, which cannot give rise to 
simulation outputs that resemble the data. Therefore a number of computational schemes have been 
proposed that makes ABC inference more efficient. These come in loosely three flavours: regression- 



Carlo ABC schemes (Marjoram & Molitor 



adjusted ABC (Tallmon 2004 Fagundes et al, 2007; Blum & Frangois 2009), Markov chain Monte 



2003 



Ratmann et al., 2007), and ABC implementing some 



variant of sequential importance sampling (SIS) or sequential Monte Carlo (SMC) (Sisson et al, 2007 



Toni et al, 2009, Beaumont et al, 2009 Del Moral et al, 2008). Of these the first and last forms have 



received the greatest attention and it is an ABC scheme based on sequential importance sampling 
that we will focus on as it offers greater flexibility and applicability and appears to be enjoying greater 
popularity in applications. 



We focus on the implementation of Toni et al. (2009), which like other related SIS and SMC meth- 



ods works by constructing a series of intermediate distributions that start out from a suitably specified 
prior distribution and increasingly resemble the (unknown) approximate posterior distribution. These 
intermediate distributions are defined by 



Pet (8)* / f(y\8)t(A(x,y)<e t )7r(8)dy , 



for 1 < t < T and e\ > ei > 
of N parameter vectors, 8^ %,t \ 



. . > €t = e. Each intermediate distribution is described by a sample 
1 < i < N and their corresponding weights, a/*'*-*, which together we 



refer to as particles (^ ,,f ) ) ( l j( ,,t '). Successive distributions are constructed by perturbing parameters 
through some kernel function, 8 ~ K t (-\8), generating simulated data, y ~ f(-\8) and, upon acceptance, 
calculating the corresponding new weights. Below we will develop this framework in more detail. 

While this sequential ABC approach is computationally much more efficient than simple ABC 
rejection schemes, the overall computational burden does not only depend on the complexity of the 
model and the amount of data at hand, but also on details of the chosen sequential scheme. In 
particular the e-schedule, {ei, . . . , ex}, and the choice of perturbation kernels, Kt(-\-) exert considerable 



influence on the algorithmic complexity. As in all Monte Carlo settings (Gilks & Richardson 1996 



Robert , 2004 ) problems tend to arise as the dimension of the parameter space increases and balancing 



convergence with the necessary maintenance of exhaustive exploration of the parameter space becomes 
harder. 
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The construction of suitable kernel functions has been a longstanding problem in the context 



of conventional Markov chain Monte Carlo analysis (Givens 



1996 



2008; Cornebise et al, 2008; Cornuet et al, 2009 Girolami & Calderhead, 2011). In the SMC setting 



Douc et al. 2007 Cappe et al. 



the first strategies to adapt the proposal kernel were early as (Pitt & Shephard, 1999), with further 



appealing but ad- hoc methods proposed in Van Der Merwe et al. (2001); but it is still far from being 



solved. Some recent and more theoretically based approaches including Cornebise et al. (2008, 2011); 
but especially in an ABC context formal and informal understanding of kernel choice remain areas of 
pressing concern. 

Especially for models which are computationally expensive to simulate, such as dynamical systems, 
the likelihood/posterior surfaces (Gutenkunst et al. , 2007 Secrier et al. , 2009; Erguler & Stumpf 2011 ) 
suggest that the choice of the kernel will have huge influence on the efficiency with which parameter 
spaces are explored and posterior estimates obtained. Here we will discuss a range of kernel functions, 
characterize their performance, and put forward some analytic results as to their optimality. In the 
next section we discuss the ABC scheme in some detail before describing criteria for optimally choosing 
the perturbation kernels and outlining different classes of perturbation kernels. We then examine the 
performance of these kernels in applications to a range of illustrative problems and, compare their 
algorithmic complexities. We will show that for many problems with complex posterior parameter 
distributions the choice of suitable kernels can vastly improve the computational cost of ABC SMC 
inferences. 



2 The ABC SMC algorithm 

The general scheme of ABC inference is as follows: 

• sample a parameter vector 9 (also called particle) from the prior distribution n(0), 

• simulate a dataset y according to the generative model f(y\9), 

• compare the simulated dataset with the experimental data x: if A(x, y) < e, accept the particle. 
The N accepted particles form a sample from the posterior distribution 

t(A(x,y)<e)f(y\6)Tr(6)dy 



Pe(9\x) 



which is an approximation of the posterior distribution p{9\x). 

Over the past few years many improvements of these algorithms have been proposed. In particular, 



Marjoram Sz Molitor (2003) introduced a method based on Markov chain Monte Carlo, which consists 
in constructing a Markov chain whose stationary distribution is p e (9\x). To do so, at each time t, a 
particle 9 is simulated from the previous particle according to a perturbation kernel 

(or from the prior distribution for t = 0); the simulated data y ~ f('\&) is compared with the 
experimental data; and 0w is set to be equal to 9 with a Metropolis Hasting acceptance rate if 
x) < e and to 9^^ otherwise. This algorithm is guaranteed to converge, however it is very 
difficult to assess when the Markov chain reaches the stationary regime; furthermore the chain may 
get trapped in local modes. 



SIS and SMC samplers have then been introduced in the ABC framework by several authors ( Sisson 



et al. 


2007 


Toni et al. 


2009; 



Beaumont et al. 2009, Del Moral et al., 2008). These methods aim to 



sample sequentially from a sequences of distributions, which increasingly resemble the target posterior; 
they are constructed by estimating intermediate distributions p tt (9\x) for a decreasing sequence of 
{tt}i<t<T- The scheme of the algorithm is as follows: first, the ABC algorithm described above is 
used to construct a sample from p ei (9\x) with a sufficiently large value of ei such that many particles 
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are accepted. The ABC algorithm is then used again with e = £2; but instead of sampling parameters 
from the prior, they are sampled from the set of accepted particles at the previous stage and perturbed 
according to a suitable perturbation kernel. This way a sample from p e2 (9\x) is built, and so on until 
for t = T our target posterior has been arrived at. In this article, we focus on the implementation 
of Toni et al. (2009) described in Algorithm [lj from a decreasing sequence of {et}i<t<T and a set 
of perturbation kernels {i?t(-|-)}i<t<T (see also iToni & Stumpf (2009a)), the algorithm generates a 



weighted sample of particles from p £T (9\x). In the following we will refer to this implementation as 



the ABC SMC algorithm according to Toni et aZ.| ( |2009| ). The ABC Population Monte Carlo (PMC) 
algorithm proposed by Beaumont et al. (2009) is similar to the ABC SMC algorithm, except that 



a specific perturbation kernel is used. It is, however, worth distinguishing between these algorithms 



and the one of Del Moral et al. (2008) and Drovandi fc Pettitt (2011) based on the SMC sampler 



of Del Moral & Doucet (2006). When using SMC samplers, both a forward and a backward kernel 
need to be defined, which reduces the algorithmic complexity from 0(N 2 ) to O(N) where N is the 
number of particles. However, as we will argue later, in many applications of interest, the most 
computationally expensive part of an ABC algorithm is the simulation of the data which is usually 
much larger compared to 0(N 2 ). We will therefore not discuss kernel choice in the context of these 
approaches, although here, too, the choice of kernel will impact the numerical efficiency. 



Algorithm 1 ABC SMC algorithm 



input: a decreasing sequence of (et)i<t<T such that €t = e, a data x, a sequence of {K t {-\-))i<t<r 



2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 



21 

22 
23 
24 



output: a weighted sample of particles from p tT {9\x) 
for all 1 < t < T do 

determine the parameters of the perturbation kernel Kt(-\-) 

i <- 1 

repeat 

if t=l then 

sample 9 from ir(9) 
else 

sample 9 from the previous population {9^' t ^ 1 ^}i<i<N with weights {uj( l ' t ~ 1 ^}i<i<N 
sample 9 from Kt{-\9) and such that ir(9) > 
end if 

sample y from f(-\9) 
if A(y, x) < et then 
0(vO <_ 

i «- i + 1 
end if 
until i = N + 1 

calculate the weights: for all 1 < i < N 
if 1 then 



else u^' 1 ' ^— 1 
end if 

normalize the weights 
end for 
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3 Properties of optimal kernels 



The behaviour of the algorithm depends on its settings: in particular the decreasing sequence of 
{tt}i<t<T and the perturbation kernels {K t (-\-)}2<t<T- The effect of the sequence of decreasing 
threshold is easy to understand: if the difference between two successive tolerances et and et+i is 
small, the posterior distributions p tt (0\x) and p et+1 (9\x) are similar and a small number of simulations 
will be required to generate N draws from the next intermediate distribution, p e (0\x), by sampling 
from the weig hted population (0(^-1) 

i<i<iV- But a slowly decreasing sequence of thresholds 
{tt}i<t<T leads to a large number of iterations (large value of T) in order to obtain ex = e. Similarly, 
the choice of the perturbation kernels {-Kt(-|-)}i<t<T exerts considerable influence on the computa- 
tional complexity of the algorithm. A local perturbation kernel hardly moves the particles and has the 
advantage to produce new particles which are accepted with high probability if the successive values 
of e are close enough; on the other hand, a widely spread out or permissive perturbation kernel enables 
exploring the parameter space more fully, but does so at the cost of achieving only low acceptance 
rates. 

In sequential importance sampling, a perturbation kernel Kt should fulfill several requirements to 
be computationally efficient. In particular, the joint proposal distribution, corresponding to picking 
a particle at random and perturbing it to obtain a new particle, should "resemble" in some sense 
the target joint distribution, corresponding to picking independently two particles. More precisely, 
the joint proposal distribution of a particle, which samples first a particle 9^ t ~ l " > ~ p €tl (-\x) then 
a perturbed particle 9^ ~ Kt{-\9^~ 1 ^), and accept the couple if and only if A(y,x) < et where 
V ~ /('l^ )> admits for density 

et ' uet ' a(K t ,e t -i,e t ,x) 

The normalization factor 

a(K u e t -i,e u x) = [[[ p et _Ae {t - 1] \x)K t {9^\9^)f{y\9^) 1 (A(x,y) < e t )d9^d9^dy (1) 



is the average acceptance probability, that is, the proportion of proposed particles that will not be 
rejected. This joint proposal distribution should "resemble" in some sense the target product distri- 
bution, that of sampling and 9® independently from, respectively, p et _ 1 (-\x) and p et (-\x), whose 
density is 

d,^' -1 ^®!*) =p et - 1 {e< t -»\x)p et (e®\x) . 

As argued by several authors, e.g. |Douc et ~al] ( 2007) ; |Cappe et al. (2008); Cornebise et al. (2008); 



Beaumont et al. (2009), a mathematically convenient formal definition of this "resemblance" is the 
Kullback-Leibler (KL) divergence between the proposal distribution q et _ lt e t 

(flC*- 1 ), #(*)) an d the target 

distribution <_ ljet (6' ( *~ 1) , 9®), i.e. 



KL{q et _ lM \ql t _ ltet ) 



( e (t-i) e (t)\ ^ g£-i,*t ( 0t 1 ' 6 f) ) de (t-l) de {t) 



which can be separated into three terms 

KL(q et _ 1>et ;ql t _ 1>et ) = -Q(K t ,e t -i,e t ,x) +loga(K t ,e t -i,et,x) + C(e t -i, e t , x) , (2) 

where 

Q(K t ,e t ^,e t ,x) = jj p^^x^^x) log K t (9^\9^)d9^d9^ , (3) 
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can be maximized easily (hence minimizing —Q) in some convenient cases (see Section]!]), a(Kt, et—i, et, a 
is the average acceptance probability already defined in ([!]), which is much harder to minimize, and 
C(ef_i, et, x) does not depend on the kernel Kt and can therefore be ignored. This provides a rational 
criterion for choosing optimally adapted perturbation kernels: we want a kernel Kt that minimizes 
the quantity 



Beaumont et al. (2009) follow a similar line of reasoning, albeit considering only the asymptotic 
case where tt-\ = £t = 0. Therefore, they minimize KL(qo ! o; q$ ), and, since a(K t , 0, 0, x) = 1 for any 
Kt, they would ideally want to find the kernel that maximizes Q{Kt, 0, 0, x). Because it is impossible 
to solve this problem in the asymptotic case, they revert back to the non-asymptotic case by setting 
et = et-i for both thresholds, and therefore eventually solve 

argmax^ Q{K t , e t -i,e t -i,x) , 

which admits a simple solution for certain adequately chosen families of kernels. While they restrain 



themselves to the component- wise Gaussian kernel (see Section 4.1), their approach tends to work 
well. 

However, a close study of Equation ^ shows that this approach does not actually minimize the real 
Kullback-Leibler divergence in the non-asymptotic case, as we would hope to. Eventually minimizing 
this KL divergence would require two modifications: evaluate Q(K t ,et-i,et,x) with the two distinct 
thresholds et~\ ^ et, as we easily do in Section [4]- it simply amounts to replacing a covariance by a 
cross-covariance in the solution - and, much more troublesome, take into account log a(K t , et-\, et, x). 
This second point is by and large impractical; there is no closed-form (that is, easily computable) solu- 
tion to maximize the acceptance-rate. Therefore, our original goal of minimizing KL(q et _ ljet ;q* ) 
per se would seem unattainable. 

Starting from we have that 

Q(K t ,et-i,e t ,x) = -KL(q et _ u e t ;qt t _ u e t ) + loga(K t ,e t -i,e t ,x) + C(e t -i, e t , x) . (4) 
The two following maximization problems are therefore equivalent: 

argmax^ Q(K t ,e t -i,e t ,x) = argmax^ (-KL(q et _ 1;6t ; g* t _ 1)£ J + loga(K t , e t -i, e t , xj^ . (5) 

As we mentioned and as we will show in Section [4] this problem is easy to solve, since the left-hand 
side often admits a closed-form solution. The most important remark is that the right-hand side is 
the solution of a multi-objective optimization problem, solving a trade-off between jointly minimizing 
the Kullback-Leibler divergence and maximizing the logarithm of the average acceptance probability. 
Multi-objective optimization by using an additive combination of two distinct objective functions is a 



common practice, see e.g. Section 4.7.5 of Boyd & Vandenberghe (2004). Both of those properties are 



actually wished for in an efficient ABC SMC proposal kernel. Besides, although the weights of such 
additive combination are here forced upon us, we note that the use of the logarithm of the acceptance 
probability strongly penalizes very low probabilities, while making equally desirable moderate to large 
acceptance probabilities, a reasonable preference from the computational point of view. 

The practical consequences of this rather theoretical argumentation are therefore threefold. We 
suggest to choose the proposal kernel 

K t = argmaxj^ Q(K t , e t -i,e t , x) (6) 

where Q is defined in equation ^ . We have shown that this choice corresponds to a trade-off between 
two desirable properties of the kernel, namely the resemblance of the proposal distribution and the 
target in the sense of the KL divergence, and a high acceptance rate. And our criterion not only sheds 
new light on the justification of some existing, proven criteria, but, additionally, refines them. 
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4 Optimal choice of random walk kernels 



Perturbing a parameter consists of sampling a new particle according to a probability parametrized 
by and often centred on 6. In the ABC SMC algorithm, in addition to sampling from the kernel, 
we must be able to compute the transition density Kt(6^' t ^\9^' t ~ 1 ^) for any particles and flCW -1 ), 
Then, instead of choosing the perturbation kernel Kt that maximises equation ^ over all possible 
kernels, the space of possible kernels is often restrained to a parametric family from which it is easy to 
sample and perform optimization. Different probability models may be used, with the most common 



Sisson et al. 


2007, 


Toni et al. 


2009 


Liepe et al. 



2010). In the following, we outline different classes of Gaussian perturbation kernels and compare 



their efficiency in terms of acceptance rates and computational cost. 
4.1 Component- wise perturbation kernel 

In most cases the particle is moved component-wise: for each component 1 < j < d of the parameter 
vector 6 = (6*i, • • • 6d), 0j is perturbed independently according to a Gaussian distribution with mean 6j 



Didelot et al. 



2011 



previous population 
a kernel of the form 



and variance aj. The parameters {o~j}i<j<d may be fixed in advance, but more frequently (Beaumont 
eTaT[ [20091 IMcKinley et al.\ [20U9l |Toni fc Stump!! |2009b[ |Jasra et al.j |20l0l |Barnes et al.j |201l " 
adaptively set parameter choices Wf^}i<j<d are used which depend on the 
- the scale or variance is then indexed by the population index, t. Considering 



n 

i=i 



j(t) 



g(*-l)\2 
3 ' 



(7) 



and maximizing Q(Kt,0, 0,x) — or more precisely Q(Kt,et-i,et-i,x) — Beaumont et al. (2009) 



showed that the optimal value of crj is twice the variance of the j-th component in the previous 
population: 



,(*) 



2Var({#f'*- 1) } 1 < fc < iV ) 



Maximizing Q(Kf, et—i, e$, x), however, leads to a slightly different choice of which is 



a 



(*) 



E n 



j(t) 



j(*-l)\2 



■j -vn.^p^x) - », ) ■ (8) 

This quantity can be computed from the set {^^' i_1 \w^'*~ 1 ),y^' t_1 )}i<j<Ar , where is the 

observation simulated according to as follows 

N N 



i=l k=l 



a(M-l)\2 



(9) 



We 



denote by {6^}i<fc<Ar the set of particles X ^ s.t. A(x,y^ t ' t ^) < e t , l<i< Nj and by 



{(^ J}i<fc</Vo the associated weights normalized such that ^ 



No ~{k) 



k=l' 



1. 



Another commonly used component- wise kernel is the uniform kernel, which consists of perturbing 



(*)., 

j ^3 



a 



(t) 



with density l/2a 



the j-th component of particle 9 to any value in the interval 9j — a 

A natural choice is to set the parameter op to the scale of the previous population, that is 



(*) 



a 



it) 



max j^'* 1 H — min j^I;-' 1 ''' 1J | 



(M-l)n 



Note that the main difference between the uniform and the component-wise normal kernels concerns 
their support: finite vs. infinite. 
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Figure 1: A population of particles and isodensity curves for a uniform kernel (left), a component-wise 
normal kernel (centre) and a multivariate normal perturbation kernel (right) around one particle (red 
point). 

4.2 Multivariate normal perturbation kernels 

Consider a population of two-dimensional parameters that are highly correlated. The perturbation 
of a particle according to the uniform kernel consists simply in sampling a parameter uniformly in a 
rectangle whose sides are parallel to the axes (see Figure [l] left). Similarly, the density levels of the 
component-wise normal kernel are ellipsoids whose principal axes are parallel to the parameter axes 
(see Figure [T] center) . But for highly correlated parameters the use of those two perturbation kernels 
in the ABC SMC framework leads to a small acceptance rate as they only poorly reflect the structure 
of the true posterior. 

Instead of using a component-wise kernel, it may thus be more efficient to take into account the 
correlations between the different elements of the parameter vectors, in effect perturbing the particles 
according to a multivariate normal distribution with a covariance matrix S^, which depends on the 
covariance of the previous population. Figure [l] (right) represents a multivariate normal perturbation 
kernel for a d dimensional square matrix £W proportional to the covariance of the previous population. 
We observe that fewer particle proposals are likely to be rejected with this perturbation kernel than 
with the uniform or component-wise normal one. 

The multivariate normal perturbation kernel relies on the covariance matrix which depends 
on the previous population. As before, it is possible to calculate the optimal covariance matrix T,® 
using the Kullback-Leibler divergence minimization approach. Maximizing equation ^ for 

K t {0®\e<t-V) = (27r)- d / 2 (detsW) _1 exp|-^ (o® - O^f (s^)" 1 - 0®) 

with respect to leads to maximizing the real- valued function 

g(M) = logdet (M) - E Ph _ i{ .\ x)Ph{ .\ x) [{9® - 9^) T M{0® - 0^ 

with respect to the symmetric d x d matrix M, and defining = M _1 . We denote by v T the 
transpose of vector v. Taking the partial derivatives of the function g : U dxd — y R J we obtain, for all 
1 < i, 3 < d, 



(M-% - — Tr (MC) = (M~ l )ij - C 3i = (M -1 )^- - Cy . 
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where Tr(M) denotes the trace of the matrix M, and the symmetric matrix C is equal to 



C = E„ 



(0(*)_0(t-l))(0(t)_0(*-l))5 



Therefore the covariance matrix of the optimal kernel in the multivariate Gaussian family is 
the matrix C defined above. Similarly to the component- wise case, if et = et-i then the optimal 
covariance matrix is equal to 2Cov({0( fc >*)} 

i<fc<iv)- I n the general case, however, an optimal choice of 
the covariance matrix El^ for the multivariate normal perturbation kernel is then 



£(*) 



N N 

EE 

i=l k=l 



Q(i,t-l)\f(j(k) _ g(i,t-l)yj 



(10) 



where {0^}i<k<N an d {^^}i<A<JVo are defined as previously. In the following, if nothing else is 
specified, the multivariate normal kernel refers to the kernel with this choice of covariance matrix. It 



is worth noting that Beaumont et al. ( 2009 ) restrained their optimality criteria to cases where either 



only a single parameter is to be estimated or where the covariance matrix is diagonal. Our scheme 
generalises their criterion to a much broader class of kernels. 



4.3 Local perturbation kernels 

Let us now consider one of the canonical examples of a posterior distribution which poses challenges to 
simple kernels: the so-called banana shape distribution in two dimensions. In this case the components 
of the parameters are highly correlated but the multivariate normal and the component-wise normal 
kernels discussed above behave similarly (see Figure [2] left). Indeed, the covariance matrix based on all 
the previous particles is nearly diagonal. It yields only limited information about the local correlation 
among the individual components of the parameter vectors. In such cases it is interesting to consider 
the use of a local covariance matrix which differs between particles. In the following we discuss three 
local perturbation kernels for which the particle 9 is perturbed according to a multivariate normal 
kernel whose covariance matrix si is a function of 9. 



4.3.1 The multivariate normal kernel with M nearest neighbours 

The multivariate normal kernel with M neighbours follows this principle: for each particle 9 6 
{0('-*),l < I < N}, the M -nearest neighbours of 9 are selected, and the perturbed particle is sampled 
according to a multivariate normal distribution of mean 9 and of covariance si* w estimated from the 
M neighbouring particles. 

The main drawback of this perturbation kernel is that the parameter M typically needs to be fixed 
in advance before any of the intricacies of the posterior are known. Figures [2] (centre) and [2] (right) 
illustrate the effect of the parameter M on the perturbation kernel. Using too small a value may lead to 
a lack of exploration of parameter space, while too large a value of M would offer little or no advantage 
compared to the standard multivariate normal kernel. Ideally, a mixture of multivariate normal kernels 
with different values of M should be used; however, in practice, this solution is computationally too 
expensive. 



4.3.2 The multivariate normal kernel with optimal local covariance matrix 



The theoretical calculation of the optimal covariance matrix above (see Section 4.2) may be adapted 
to identify an optimal local covariance matrix. Considering that the covariance matrix si depends 
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Figure 2: A population of particles and isodensity curves of a multivariate normal perturbation kernel 
with repectivly a covariance based on the whole population (left), on the 50 nearest neighbours (centre) 
and on the 200 nearest neighbours (right) around one particle (red point). The nearest neighbours 
are represented by yellow points. 



on the particle 9, and following the same steps as before, it is easy to show that the covariance matrix 
for a particle 6>(*~ 1 ) ~ Pet-i('\ x ) maximizing equation ^ is such that 



E, 



O-i) 



E 



P H (-\x) 



(0(*)_0(*-i))(0(t)_0(t-i); 



T 



. 



In particular, the set of covariance matrices such that, for all t and each particle 0^'* 1 \ 1 < j < N, 



J 0(i,*-i) 



E 



P H (-\x) 



satisfies the above condition. The multivariate normal perturbation kernel whose covariance matrix 
is equal to T,^ t _ r) as defined above will be referred to as the multivariate normal kernel with OLCM 
(where OLCM stands for optimal local covariance matrix) . We now have a different covariance matrix 
^(t-i) fo r eacn particle 0(-?' t_1 ) of the previous population. 




Figure 3: The eigenvectors of I 1 (6) (red arrows) of size proportional to the eigenvalues for a particle 
9 (red point). 
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4.3.3 Perturbation kernel based on the Fisher information 



The final perturbation kernel considered here uses information from the generative model via the 
Fisher Information Mat 
(FIM) defined as 



Rao 


1945; 


MacKay 


2003 


Cox, 


2006) 



1(9) = -E x 



dd 2 



log/(X|0) 



measures the amount of information that the observable random variable X carries about the pa- 
rameter 9. As previously mentioned the ABC algorithm is mainly used when the likelihood function 
f(-\9) is not known, and so the Fisher Information Matrix can not be computed exactly. Neverthe- 



less, Komorowski et al. (2011) have developed a method that evaluates the FIM for deterministic and 
stochastic dynamical systems represented by ordinary differential equations (ODEs) and stochastic 
differential equations (SDEs) using the linear noise approximation, which can be applied in such cases 
on the fly as the ABC SMC procedure progresses. 

In the Laplace expansion, the eigenvectors and eigenvalues of the inverse of the FIM 1(9) map out 
ellipsoidal levels of equal density around the parameter 9. The eigenvectors of I~ l (9) (scaled such 
that their lengths are proportional to their corresponding eigenvalues) are represented for two different 
values of 9 in Figure [3j The directions of the eigenvectors and the relative size of their eigenvalues 
are both relevant for perturbing the parameter 9 efficiently. However, in Figure [4] we observe that the 
determinant of I^ 1 (9) varies exponentially with 9 over 5 orders of magnitude. The determinant of 
1(9) is of course a measure of the amount of information available around 9; its value is very small 
for some parameters 9 and this leads to a perturbation kernel with too large a covariance. On the 
other hand, if the determinant of 1(9) is large, additional information may be gained by moving only 
in the direct vicinity of 9, and a perturbation kernel based on the inverse of the FIM explores only the 
immediate neighbourhood of the parameter. We therefore consider two versions of the multivariate 
normal perturbation kernel based on the FIM: the first consists of normalizing the matrix such that 
its determinant is equal to the determinant of the covariance matrix of the previous population 



( det(sW) 
ldet(/-i(fl)) 



l/d 



where is the matrix defined equation (10), and the second consists of normalizing the matrix such 

based on the M nearest 



,(') 



that its determinant is equal to the determinant of the covariance matrix T, e ' M 
neighbours of the particle defined according to 



I det(S^)) 
'' M IdetCl" 1 ^)) 



l/d 



r\9) . 



The parameter M may, for instance, be equal to 20% of the previous population 
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Figure 4: Logarithm of the determinant of I l {9) rescaled to be in [0,1] for each particle 6. The 
maximum value of I~ l {6) over the population is equal to 85745 and the minimum one is equal to 0.14. 

5 Numerical results 

We first apply the ABC SMC algorithm with different kernels to three illustrative examples, which 
exhibit certain pathological features that highlight differences between the perturbation kernels con- 
sidered here. 

5.1 Ellipsoid shape 

We begin with a toy example where the prior distribution of the two dimensional parameter is a 
uniform distribution on the square [—50, 50] x [—50, 50] and the likelihood function is given by 

x~N({9 1 - 26 2 ) 2 + (e 2 -4) 2 ,l) . 

It is assumed that x = is observed. The posterior density is then 

P (0\x) oc 0(0; (0x - 28 2 f + (6 2 - 4) 2 , l)lr 

-50,50] x [-50,50] 

where 4>(x;[i,a 2 ) is the one dimensional normal density with mean \x and variance a 2 , and is repre- 
sented in Figure [5| The ABC SMC algorithm is used to estimate p e (6\x) for N = 800 particles, e = 1 
and the decreasing sequence of (e t )i< t <T equal to (160, 120, 80, 60, 40, 30, 20, 15, 10, 8, 6, 4, 3, 2, 1). We 
compare 5 different perturbation kernels: the uniform kernel, the component-wise normal kernel, the 
multivariate normal kernel with the covariance matrix computed from the whole previous popula- 
tion, the multivariate normal kernel whose covariance matrix is computed according to the M nearest 
neighbours of each particle, with M = 50, and the multivariate normal kernel with OLCM. 
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Figure 5: Scatter plot and marginal posterior distribution for an ellipsoid posterior. Top left and 
bottom right: the marginal posterior distributions; Top right and bottom left: each population of 
particles for all the values of e in the parameter space. 



Figure [6] shows that the acceptance rate differs significantly between kernels. The uniform kernel 
has the same acceptance rate as the component-wise normal kernel. Given the shape of the posterior 
distribution, it is easy to understand that a multivariate normal kernel results in a larger acceptance 
rate than other kernels. Since the two components of the parameters are strongly correlated using 
an estimate of the covariance from the previous population instead of an estimation of only the 
component-wise variances makes a marked difference on the acceptance rates. Both the multivariate 
normal kernel based on the 50 nearest neighbours and the one based on the OLCM result in acceptance 
rates over two times higher than component-wise kernels. 
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Figure 6: Ellipsoid Shape. Average of the acceptance rate over 10 independent runs for 5 different 
kernels: the uniform kernel (green), the component- wise normal kernel (blue), the multivariate normal 
kernel (black) , the multivariate normal kernel with 50 neighbours (red) , the multivariate normal kernel 
with OLCM (cyan). 
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5.2 Ring shape 






Figure 7: Scatter plot and marginal posterior distribution for a ring-shaped posterior. Top left and 
bottom right: the marginal posterior distributions; Top right and bottom left: each population of 
particles for all the values of e in the parameter space. 

In the second toy example, the prior distribution of the two dimensional parameter is still a uniform 
distribution on the square [—50, 50] x [—50, 50] but the likelihood function is 

x ~ M (0? + 6*1,0.5) . 

Again we assume that x = is observed; the posterior density is then 

p(e|x)oc0(O;e? + ^,O.5)l [ _ 5O) 5O]x[-5O,5O]W • 

As in the previous example we used the ABC SMC algorithm with N = 800 particles and the decreasing 
sequence of thresholds is equal to (160,120,80,60,40,30,20,15,10,8,6,4,3,2,1). We compare the 
same 5 perturbation kernels. 

The posterior distribution, represented by Figure[7J has a ring shape centred around 0. In this case, 
in contrast to the previous example, the multivariate normal perturbation kernel using an estimation 
of the covariance based on the previous population, as well as the OLCM version of it, have an 
acceptance rate similar to the component-wise normal perturbation kernel. Indeed, in this example, 
the correlation between the two parameters, 8\ and 82, at the whole population level is weak. This 
kind of shape needs a more local perturbation kernel in order to obtain higher acceptance rates. This 
is the case for the perturbation kernel based on the covariance matrix computed from the 50 nearest 
neighbours. 
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Figure 8: Ring Shape. Evolution of the acceptance rate for 10 independent runs and for 5 different 
kernels: the uniform kernel (green), the gaussian kernel (blue), the multivariate normal kernel (black), 
the multivariate normal kernel with 50 neighbours (red), the multivariate normal kernel with OLCM 
(cyan). 
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5.3 Banana shape 






Figure 9: Scatter plot and marginal posterior distribution for a banana-shaped posterior. Top left 
and bottom right: the marginal posterior distributions; Top right and bottom left: each population 
of particles for all the values of e in the parameter space. 

In the third toy example, the prior distribution of the two dimensional parameter is still a uniform 
distribution on the square [—50, 50] x [—50, 50] but the observation is two dimensional and the likelihood 
function is 

It is assumed that x = (0, 0) is observed. The posterior density is then 

p(e\x) ° C$ (( ) ; ( ^+^2 ) ,( J ^ )) 1 [-50,50] x [-50,50] (*) 

where $(ar;w,S) is the multi-dimensional normal density of mean v and covariance S. We use the 
same ABC SMC settings and again compare the 5 perturbation kernels, as well as two versions of the 
multivariate normal perturbation kernel where the covariance matrix is proportional to the inverse of 
the FIM. Here the FIM is exactly computable: 




When 62 = 0, we replace it by a very small value, 10 -4 such that 1(8) is no longer singular; safeguarding 
against singular FIMs is straightforward and a sensible precaution when running such algorithms 
without manual intervention. 

The posterior distribution is represented in Figure [9j As in the ring example, the multivariate 
normal perturbation kernel using the full estimated covariance of the whole previous population has 
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an acceptance rate similar to the component-wise normal perturbation kernel with adaptive estimation 
of the variances (see Figure 10). The multivariate normal kernel with OLCM obtains slightly better 
results. The two versions of the perturbation kernel based on the FIM have significantly different 
acceptance rates. The most efficient version is the one using the M nearest neighbours, as might be 
expected. However this kernel, as the multivariate normal kernel based on the M nearest neighbours, 
can show undesirable dependence on the chosen value of M. Figure 10 represents the acceptance rate 
evolution for this last perturbation kernel with different values of M. The acceptance rate diminish 
considerably as M increases and the kernel becomes less "locally aware" . 




Figure 10: Banana Shape. Evolution of the acceptance rate for 10 independent runs; each colour 
corresponds to a kernel. Left: the uniform kernel (green), the component- wise normal kernel (blue), 
the multivariate normal kernel (black), the multivariate normal kernel with 50 neighbours (red), the 
multivariate normal kernel with OLCM (cyan), the multivariate normal kernel based on the FIM 
version 1 (dashed magenta) and version 2 (dotted magenta); Right: multivariate normal kernels based 
on the M neighbours for M G {50, 100, 200, 300, 400, 500, 600, 700, 800} (from red to yellow) and the 
multivariate kernel with an estimated covariance based on the whole population (black) . 
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6 Real Application: The Repressilator Model 
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Figure 11: Scatter plot and marginal posterior distribution. On the diagonal: the marginal posterior 
distributions; Other plots: each population of particles for all the values of e projected into two 
components of the parameter space. 



To analyse the differences between the efficiency of the perturbation kernel in a real application we 



focus on the repressilator model, a popular model for gene regulatory systems (Elowitz Sz Leibler 



2000). This model also exemplifies the challenges that are frequently encountered in attempts to 



reverse engineer the structure and parameters of dynamical systems from data ( Toni et al. 2009 



Girolami & Calderhead, 2011). There are now several studies that appear to demonstrate that, even 



for large data-sets, only about a third of parameters of dynamical systems can be inferred with high 



confidence (or high posterior probability) ( Gutenkunst et al. , 2007 ; Rand , 2008 Erguler & Stumpf 



2011); there are also signs, however, that judiciously chosen experimental conditions can lead to an 



increased information content in the data (Casey et al, 2007 Apgar et al, 2008). 



The model is described by six ordinary differential equations and a four dimensional parameter 



19 




Figure 12: Cumulative number of simulation over the iterations of the algorithm (left) and total 
of the computational time to generate N accepted particles with different kernels normalized by the 
required time using the uniform kernel (right) for the following kernels: the Gaussian kernel (blue), 
the multivariate normal kernel (black), the multivariate normal kernel with 50 neighbours (red), the 
multivariate normal kernel with OLCM (cyan), the multivariate normal kernel based on the FIM 
version 1 (dashed magenta) and version 2 (dotted magenta) 



vector, 6 = (ao,n, [3, a), 



dm\ a 
—— = -mi + - + a 
at 1 + 

dpi 

dt 

drri2 a 

= —mo H 

dt 1 + v \ 

dp2 
dt 

dm% a 

-dT = - m3 + TT P " 2 

dp3 
dt 



(3(pi - mi) 
a 

(3(p 2 - m 2 ) 

a 
1 +p 

/3(p 3 - m 3 ) 



For the simulated data, the initial condition are (mi,pi, m2,P2, «i3,£>3) = (0, 2, 0, 1, 0, 3) and the time 
points at which we observe the quantities are (0.0, 0.6, 4.2, 6.2, 8.6, 13.4, 16, 21.4, 27.6, 34.4, 39.8, 40.6, 
45.2). We assume that only the time series of mi, m 2 and 7713 (which correspond to mRNA measure- 
ments) are observed. 

The ABC SMC algorithm is used to estimate p(9\{mi(k),m2{k), m^{k)}k) with iV = 800 particles 
and a decreasing sequence of thresholds equal to (160, 150, 140, 130, 120, 100, 80, 50, 40, 37, 35, 33, 31, 
29, 27, 25, 23, 21, 20). The posterior distribution is represented in Figure 11 and agrees very well with 
what is known from previous studies (Toni et al. , 2009). In Figure [T2| (left), we compare the cumulative 
number of sampled data over the algorithm for different perturbation kernels. Using the uniform kernel, 
up to 6.5 x 10 6 simulations are required to obtain an approximation of the posterior distribution whereas 
only 1.6 x 10 5 simulations are required if the second version of the multivariate normal kernel based on 
the FIM is used. In order to contrast the efficiencies of these kernels further we show the time required 
to generate a sample of iV particles of the posterior p tt (-\x), where x = ({mi(k) , m2(k) , m^{k)}k)^ as a 
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function of t for all kernel scaled by the time required for the uniform kernel in Figure 12 (right). Using 
the FIM results in a 23-fold speed-up compared to the uniform kernel. The time spent to evaluate 
the FIM for each parameter is small compared to the time saved by sampling new parameters more 
efficiently So even in this simple model, a significant improvement is possible through appropriate 
choice of perturbation kernel. 



7 Conclusion 

In contrast to MCMC, where the pivotal role of perturbation kernels for convergence and mixing 
has been well documented, for ABC SMC approaches there has been comparatively little work. In 
particular in the ABC context, which often relies on computationally costly simulation routines, 
poor choice of the perturbation kernel will result in potentially prohibitive computational overheads. 
We have addressed this lack of suitable kernels here in a rigorous but non-exhaustive fashion by 
focusing on kernels that are based around uniform or normal/multivariate normal parametric families. 
Importantly, in all examples we were able to ensure that the different kernels had arrived at essentially 
identical posterior distributions, and for fixed et schedule we can use the acceptance rate as an objective 
criterion for the numerical efficiency of different kernels. 

For all these models it is relatively straightforward to construct optimality criteria by reference 



to the KL divergence following Beaumont et al. (2009). In higher-dimensional parameter spaces it is 



important to take into account the potentially correlated nature of parameters, and, not surprisingly 
we find that component- wise perturbation of particles tends to perform poorly compared to the other 
approaches considered here. In more complicated cases, e.g. decidedly non-Gaussian posteriors, 
multimodal posteriors, or posteriors with ridges, we find that a straightforward multivariate normal 
kernel is in turn inferior to kernels that are conditioned on the local environment of a particle. 

In most applications of interest, the computational cost of simulating the data overtakes the algo- 
rithmic complexity 0(N 2 ) of the ABC SMC scheme. We therefore argue that the choice of a kernel 
with a high acceptance rate enables users to optimize the computational cost. However, when two 
kernels have the same acceptance rate — which may happen for some shapes of the posterior — it is 
more appropriate to select the one which is cheaper in term of algorithmic complexity. The following 
table summarizes the computational cost of implementing the proposed perturbation kernels from a 
previous population of N particles with dimension d (the number of individual parameters). In the 
case of the multivariate normal kernel based on the FIM, we denote by C the computational cost of 
simulating an observation, e.g. by solving the set of ODEs or SDEs which define the generative model. 



Component-wise normal 0(dN 2 ) 

Multivariate normal based on the whole previous population 0(d 3 N 2 ) 

Multivariate normal based on the M nearest neighbours 0(dN 2 + d?M 2 N) 

Multivariate normal with OLCM 0(d 3 N 2 ) 

Multivariate normal based on the FIM 0(dCN + d 3 N 2 ) 
(normalized with entire population) 



As a general rule of thumb we would recommend the use of multi-variate kernels with OLCM which 
tends to have the highest acceptance rate in our toy-examples and is relatively easy to implement at 
acceptable computational cost. Where applicable we would advocate the use of the FIM in order 
to perturb particles in a rational way, by which we mean, taking large steps in directions where 
the information is flat, and small steps where the information changes more quickly. This will have 
the advantage of exploring so-called neutral spaces more efficiently while maintaining the acceptance 
rate. But here again there is a potential tension between global and local aspects: if we take the 
global FIM as the basis for the perturbation we may have a very poor representation of the internal 



21 



structure of e.g. the posterior (very much as might happen in the Laplace expansion); if, on the other 
hand, we evaluate the FIM based on the M nearest neighbours of a particle, then we may overly 
restrict the particles making up the next intermediate distribution. For some probability models, 
in particular those describing dynamical systems, the FIM has attracted a lot of attention recently 



(Amari, 2007; Arwini et al. 2008 Secrier et al. , 2009 Girolami & Calderhead, 2011 Erguler & Stumpf 



2011 ; Komorowski et al. , 2011 ), and it appears likely that we will be able to exploit these notions, and 
those of information geometry more generally, fruitfully in ABC SMC. 

The cost of local measures based on M nearest neighbours may seem too high to contemplate the 
use of such measures. Given the increase in acceptance rate that we have observed this matters very 
little, however, especially when the computational cost to generate simulated data from a candidate 
particle is high. 

To some, especially those from a background in evolutionary computation, the kernels discussed 
here may seem restrictive. We may, for example, wish to consider other perturbations to generate new 



candidate particles, such as recombination etc (Baragona & Battaglia, 2010), as is frequently done 



in global optimization. In principle it is possible to include this in ABC SMC approaches, as long 
as the weights for new particles can be calculated (which turns out to be relatively straightforward 
for recombination and different cross-over schemes). It has to be kept in mind, however, that these 
perturbations work best in cases where the parameter space is so under-sampled that random combi- 
nations of individual parameters are sufficiently likely to end up in a region with a more favourable 
cost-function than a local, e.g. gradient-based proposal would. While such strategies have been ap- 
plied in many optimization settings, their use in Bayesian inference is rare, since generally here the 
optimum (by whichever criterion) parameter value is of less interest than the distribution as a whole. 
For maximum a posteriori inferences such methods may be fruitfully applied, but here we do not see 
an obvious advantage (as is also borne out by simulation studies, data not shown). 

Kernel choice is one of the obvious means of speeding up ABC SMC inferences. Setting the e 
schedule optimally another. The latter is straightforwardly automated by basing the next et+i on the 
acceptance rate obtained during the generation of the intermediate distribution, p et {6\x). But again 
there is a trade-off to be made between convergence and exhaustive exploration of the parameter 
space. In particular too gentle a decrease in et may result in loss of particle diversity (in a process 
mimicking drift in population genetics). Here we believe that further investigation of FIMs may hold 
important clues to how the et are best chosen. This would, for example, resonate with the perspective 



on e proposed by Ratmann et al. (2009). 
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